04: Filter and Select

Overview

This tutorial covers two important {dplyr} functions: filter() and select(). Easy to confuse, filter() uses logical assertions to return a subset of rows (cases) in a dataset, while select() returns a subset of the columns (variables) in the dataset.

Tip

To remember which does which:

  • filter() works on rows, which starts with “r”, so it contains the letter “r”.
  • select() works on columns, which starts with “c”, so it contains the letter “c”.

Setup

Packages

We will be focusing on {dplyr} today, which contains both the filter() and select() functions. You can either load {dplyr} alone, or all of {tidyverse} - it won’t make a difference, but you only need one or the other.

Exercise

Load the necessary packages.

library(dplyr)
## OR
library(tidyverse)

Data

Today we’re going to start working with a dataset that we’re going to get familiar with over the next few weeks. Courtesy of fantastic Sussex colleague Jenny Terry, this dataset contains real data about statistics and maths anxiety.

Exercise

Read in the dataset and save it in a new object, anx_data.

On the Cloud, you can read in this dataset from the data folder using here::here().

Elsewhere, you can download the dataset, or copy the dataset URL, from the Data and Workbooks page.

Read in from file:

anx_data <- readr::read_csv(here::here("data/anx_data.csv"))

Read in from URL:

anx_data <- readr::read_csv("https://raw.githubusercontent.com/drmankin/practicum/master/data/anx_data.csv")

Codebook

There’s quite a bit in this dataset, so you will need to refer to the codebook below for a description of all the variables.

This study explored the difference between maths and statistics anxiety, widely assumed to be different constructs. Participants completed the Statistics Anxiety Rating Scale (STARS) and Maths Anxiety Rating Scale - Revised (R-MARS), as well as modified versions, the STARS-M and R-MARS-S. In the modified versions of the scales, references to statistics and maths were swapped; for example, the STARS item “Studying for an examination in a statistics course” became the STARS-M item “Studying for an examination in a maths course”; and the R-MARS item “Walking into a maths class” because the R-MARS-S item “Walking into a statistics class”.

Participants also completed the State-Trait Inventory for Cognitive and Somatic Anxiety (STICSA). They completed the state anxiety items twice: once before, and once after, answering a set of five MCQ questions. These MCQ questions were either about maths, or about statistics; each participant only saw one of the two MCQ conditions.

Important

For learning purposes, I’ve randomly generated some additional variables to add to the dataset containing info on distribution channel, consent, gender, and age. Especially for the consent variable, don’t worry: all the participants in this dataset did consent to the original study. I’ve simulated and added this variable in later to practice removing participants.

Variable Type Description
id Categorical Unique ID code
distribution Categorical Channel through which the study was completed, either "preview" or "anonymous" (the latter representing "real" data). Note that this variable has been randomly generated and does NOT reflect genuine responses.
consent Categorical Whether the participant read and consented to participate ("Yes") or not ("No"). Note that this variable has been randomly generated and does NOT reflect genuine responses; all participants in this dataset did originally consent to participate.
gender Categorical Gender identity, one of "female", "male", "non-binary", or "other/pnts". "pnts" is an abbreviation for "Prefer not to say". Note that this variable has been randomly generated and does NOT reflect genuine responses.
age Numeric Age in years. Note that this variable has been randomly generated and does NOT reflect genuine responses.
mcq Categorical Independent variable for MCQ question condition, whether the participant saw MCQ questions about mathematics ("maths") or statistics ("stats").
stars_[sub][number] Numeric Item on the Statistics Anxiety Rating Scale. There are three subscales, denoted with [sub] in the name:
- [test]: Test anxiety
- [help]: Asking for Help
- [int]: Interpretation Anxiety.
[num] corresponds to the item number. Responses given on a Likert scale from 1 (no anxiety) to 5 (a great deal of anxiety), so higher scores reflect higher levels of anxiety.
stars_m_[sub][number] Numeric Item on the Statistics Anxiety Rating Scale - Maths, a modified version of the STARS with all references to statistics replaced with maths. There are three subscales, denoted with [sub] in the name:
- [test]: Test anxiety
- [help]: Asking for Help
- [int]: Interpretation Anxiety.
[num] corresponds to the item number. Responses given on a Likert scale from 1 (no anxiety) to 5 (a great deal of anxiety), so higher scores reflect higher levels of anxiety.
rmars_[sub][number] Numeric Item on the Revised Maths Anxiety Rating Scale. There are three subscales, denoted with [sub] in the name:
- [test]: Test anxiety
- [num]: Numerical Task Anxiety
- [course]: Course anxiety.
[num] corresponds to the item number. Responses given on a Likert scale from 1 (not at all) to 5 (very much), so higher scores reflect higher levels of anxiety.
rmars_s_[sub][number] Numeric Item on the Revised Maths Anxiety Rating Scale - Statistics, a modified version of the MARS with all references to maths replaced with statistics. There are three subscales, denoted with [sub] in the name:
- [test]: Test anxiety
- [num]: Numerical Task Anxiety
- [course]: Course anxiety.
[num] corresponds to the item number. Responses given on a Likert scale from 1 (not at all) to 5 (very much), so higher scores reflect higher levels of anxiety.
sticsa_trait_[number] Numeric Item on the State-Trait Inventory for Cognitive and Somatic Anxiety, Trait subscale. [num] corresponds to the item number. Responses given on a Likert scale from 1 (not at all) to 4 (very much so), so higher scores reflect higher levels of anxiety.
sticsa_[time]_state_[number] Numeric Item on the State-Trait Inventory for Cognitive and Somatic Anxiety, State subscale. [time] denotes one of two times of administration: before completing the MCQ task ("pre"), or after ("post"). [num] corresponds to the item number. Responses given on a Likert scale from 1 (not at all) to 4 (very much so), so higher scores reflect higher levels of anxiety.
mcq_stats_[num] Categorical Correct (1) or incorrect (0) response to MCQ questions about statistics, covering mean ([number] = 1), standard deviation (2), confidence intervals (3), beta coefficient (4), and standard error (5).
mcq_maths_[num] Categorical Correct (1) or incorrect (0) response to MCQ questions about maths, covering mean ([number] = 1), standard deviation (2), confidence intervals (3), beta coefficient (4), and standard error (5).

Thinking Like a Coder

From this point onwards, we’re going to start digging into some data wrangling skills. One aspect of this process that is often difficult for people new to coding and/or accustomed to a point-and-click interface is the separation between you and the dataset. That is, you can’t “see” what’s happening as you make changes, or edit your dataset directly; instead, you write code and execute it without being able to observe the changes being made directly. When I was learning R, this felt like a glass wall between me and my dataset that I found very frustrating.

To help with this, we’ll start explicitly practicing a recursive process for any task we want to accomplish in R. It goes something like this:

  • Anticipate. Understand clearly what you want your code to accomplish, and have an idea (at least roughly, but preferably in some detail) of what the output will be when the task has been executed correctly.
  • Code. Write the code to accomplish the task.
  • Evaluate. Compare the output you have actually produced to the output you anticipated producing and judge whether the code has done what it was meant to do or not.

With these three steps, you can ACE any task in R!1

Especially if you are new to coding, I strongly recommend you practice these steps explicitly. For simpler examples or exercises, it might seem trivial. However, this iterative process is absolutely essential to accurate, safe, and efficient coding, and it will become second nature over time.

When I say “safe” code, I don’t mean that your computer will explode. Just as in data wrangling using any application or method, there’s always a risk of making mistakes. When you are coding, however, certain types of mistakes are easier to miss, because the output of your commands is often “hidden” (stored) inside objects. It is definitely possible - and happens often! - that code will run successfully, and even appear to produce the right output, but has in reality not done what you intended it to do. (There’s a great, if complicated, example of this just a bit further down.)

When I say “safe” code, I mean code that is resilient to such mistakes, and has been checked carefully to ensure that it does only and exactly what it is meant to do. In other words, code that you are (at least reasonably!) sure will produce the same results again when you run it again in the future, and that what it does is what you intended. This recursive Anticipate-Code-Evaluate process is the method I use to ensure my code is safe.

Filter

The filter() function’s primary job is to easily and transparently subset the rows within a dataset - in particular, a tibble. The filter() function takes one or more logical assertions and returns only the rows for which the assertion is TRUE. Columns are not affected by filter(), only rows.

General Format

1dataset_name |>
2  dplyr::filter(
3    logical_assertion
  )
1
Take the dataset dataset_name, and then
2
Filter it keeping only the cases where the following assertion is true:
3
A logical assertion about the variable(s) in dataset_name that returns logical (TRUE or FALSE) values.

Filtering with Assertions

The logical_assertion in the general format above is just like the assertions we saw in the first tutorial. The rows where the assertion returns TRUE will be included in the output; those that return FALSE will not. Inside the filter() command, use the names of the variable in the piped-in dataset to create the logical assertions.

As a first example, let’s subset our anx_data tibble to only contain people who completed the maths MCQs. We might then anticipate that if we accomplish this task correctly, we will have the same variables in the dataset that we had before, but only the rows where the mcq variable contains the string "maths".

To do this, we’ll of course use filter(), along with a familiar operator from the first tutorial, to write the logical assertion about a variable that exists in the anx_data tibble, namely mcq. Here’s the code:

1anx_data |>
  dplyr::filter(
2    mcq == "maths"
    )
1
Take the dataset anx_data, and then filter it keeping only the cases where the following assertion is true:
2
The value in the mcq variable is exactly and only equal to "maths".

Finally, let’s evaluate the output from this command. We do indeed get all the same variables (columns) as before - but the number of rows is smaller than it was previously. Scroll along to the mcq variable, and you will see that each row contains the same value: "maths". So, our output contains only the rows (cases) from the original tibble that have the value "maths" in the mcq variable, and NOT "stats" nor any NAs (because NA does not equal "maths"!) Success!

Remember that for exact matches like this, we must use double-equals == and not single-equals =. If you use single equals, you’re not alone - this is such a common thing that the (incredibly friendly and helpful) error message tells you what to do to fix it!

anx_data |> 
  dplyr::filter(mcq = "maths")
Error in `dplyr::filter()`:
! We detected a named input.
ℹ This usually means that you've used `=` instead of `==`.
ℹ Did you mean `mcq == "maths"`?

Naturally, we can also filter on numeric values. For our next example, let’s say we only want to retain participants from the original dataset who were younger than 40 years old. So, we’ll anticipate that the resulting dataset will only contain values in the age variable less than 40, and then write the code to do so:

1anx_data |>
  dplyr::filter(
2    age < 40
    )
1
Take the dataset anx_data, and then filter it keeping only the cases where the following assertion is true:
2
The value in the age variable is less than 40.

To evaluate this code, we can again look through the output and check whether each value in the age variable is less than 40.

This is a bit less easy to check by eye than the first example. If we wanted to do more in-depth checks to ensure our code worked correctly, we could carry on with our pipe and use some of the techniques we covered a couple tutorials ago.

anx_data |>
  # Perform the filtering as above
  dplyr::filter(
    age < 40
  ) |>
  # Get out all the values in the age variable in the output tibble
  dplyr::pull(age) |> 
  # Return the max value of those ages
  max()
[1] 39

Note that this works as a check for the filter() command because this is one continuous pipe. That is, the values in the age variable extracted by pull() are from the filtered dataset, because the input to pull() is the output from filter() in the preceding step.

As an extension of this, instead of using a specific value (like 40), we can also use values that we calculate using the data itself. For instance, let’s only retain participants as old as the median age of the sample, or younger. Here we can take advantage of the fact that we can use variable names as objects inside {dplyr} functions like filter()2. Then we write a logical assertion just like we have done in previous tutorials.

anx_data |> 
  dplyr::filter(
    age <= median(age, na.rm = TRUE)
  )

As a final example, let’s consider a situation where we want to retain only participants that gave a gender identity of either “male” or “female”.3

To do this, we need a new operator: %in%, which God knows I just pronounce as “in” (try saying “percent-in-percent” three times fast!). This looks for any matches with any of the elements that come after it:

1anx_data |>
  dplyr::filter(
2    gender %in% c("female", "male")
    )
1
Take the dataset anx_dat, and then filter it keeping only the cases where the following assertion is true:
2
The value in the gender variable matches any of the values “female” or “male”.

{#dangerous-example}

Why not ==?

What follows here is a rabbit hole that gets into some gritty detail. If you’re happy to take my word for it that you absolutely, definitely needed %in% and not == in the previous exercise, you can skip the explanation below. If you’re keen to understand all the nuance, click to expand and read on!

For this matching task, you might have thought we’d use gender == c("female", "male"), which runs successfully and sure looks okay. So why isn’t this right?

## DO NOT DO THIS
anx_data |> 
  ## THIS DOES NOT DO WHAT WE WANT!!
  dplyr::filter(gender == c("female", "male"))
## DANGER WILL ROBINSON

At a glance it looks like this produces the same output as the solution above - gender now contains only male or female participants. As you might have gathered from the all-caps comments above - intended to prevent you from accidentally using this code in the future for tasks like this - this is NOT what this code does.

To demonstrate what it does do, I need the dplyr::mutate() function from the next tutorial to create some new variables. The first new variable, double_equals, contains TRUEs and FALSEs for each case using the assertion with ==. The second is exactly the same, but reverses the order of the genders - something that should NOT make a difference to the matching! (We want either female OR male participants, regardless of which we happen to write first.) The third, in_op, contains the same again but this time with %in%. The final arrange() line sorts the dataset by gender to make the output easier to read.

anx_data |> 
  dplyr::mutate(
    double_equals = (gender == c("female", "male")),
    double_equals_rev = (gender == c("male", "female")),
    in_op = (gender %in% c("female", "male")),
    .keep = "used"
  ) |> 
  dplyr::arrange(gender)

Notice anything wild?

For participants with the same value in gender, the assertions with == both flip between TRUE and FALSE, but in the reverse pattern to each other. The assertion with %in% correctly labels them all as TRUE. WTF?

What’s happening is that because the vector c("female", "male") contains two elements, the assertion with == matches the first case to the first element - female - and returns TRUE. Then it matches the second case to the second element - male - and this time returns FALSE. Then because there are more cases, it repeats: the next (third) case matches female and returns TRUE, the next male and FALSE, and so forth. The == assertion with the gender categories reversed does the same, but starts with male first and female second. Only %in% actually does what we wanted, which was to return TRUE for any case that matches female OR male.

This is a good example of what I think of as “dangerous” code. I don’t mean “reckless” or “irresponsible” - R is just doing exactly what I asked it to do, and it’s not the job of the language or package creators to make sure my code is right. I mean dangerous because it runs as expected, produces (what looks like) the right output, and even with some brief checking, would appear to contain the right cases - but would quietly result in a large chunk of the data being wrongly discarded. If you didn’t know about %in%, or how to carefully double-check your work, you could easily carry on from here and think no more about it.

So, how can we avoid a problem like this? Remember the ACE process described at the start of this tutorial: Anticipate, Code, Evaluate. In this case, the Evaluate step is key to catch that although the == code runs successfully, it has not done what we wanted. Especially for new operators or functions, it’s worth spending a lot of time on evaluating the output and checking what the code has produced in several ways to be sure it’s done what you wanted (and NOT done anything you didn’t want).

There are lots of ways to evaluate the output of code you’ve written, and over time you’ll develop your own checks that make the most sense to you. For now, one option is the code I created above, with new columns for the different assertion options - but this might be something you’d only think to do if you already knew about %in% or suspected there was a problem. A more routine check might stem from the Anticipate step, where I might lay out what I want the output of my code to look like:

I expect that when my filtering is accomplished, my dataset will contain all and only the participants who reported a gender identy of female or male, and no others. I will also have the same number of cases as the original dataset, less the number of other gender categories.

First, I’ll create a new dataset using the filtered data.

## SERIOUSLY THIS IS BAD
anx_data_bd <- anx_data |> 
## DON'T USE THIS CODE FOR MATCHING
  dplyr::filter(gender == c("female", "male"))
## STOP OH GOD PLEASE JUST DON'T

Check 1: Filtered data contains only male and female participants.

anx_data_bd |> 
  dplyr::count(gender)

Only female and male participants! Tick ✅

At this point, though, I might become suspicious. The original dataset contained 465 cases - we’ve lost more than half! Can that be right? Better check the numbers.

## Get the numbers from the original dataset
anx_data |> 
  dplyr::count(gender)

Uh oh. Already we can see that something’s wrong with the numbers. But instead of relying on visual checks, let’s let R tell us.

## Calculate how many cases we expect if the filtering had gone right
expected_n <- anx_data |> 
  dplyr::count(gender) |> 
  ## This isn't the best way to filter
  dplyr::filter(gender != "non-binary") |>
  ## The next section on multiple assertions has a much better method!
  dplyr::filter(gender != "other/pnts") |> 
  dplyr::pull(n) |> 
  sum()

## Ask R whether the expected number of rows is equal to the actual number of rows in the filtered data
expected_n == nrow(anx_data_bd)
[1] FALSE

Now we know for sure there’s a problem and can investigate what happened more thoroughly.

As a final stop on this incredibly lengthy detour (are you still here? 👋), you might wonder whether the check above would give me the wrong answer, because I used two filter()s in a row, and the whole point of this goose chase is how to accomplish that exact filtering task. First, this is NOT the way I would do this (as the comments suggest), but I’m really trying to stick to ONLY what we’ve already covered wherever possible. But let’s say I’d tried to do this with the bad == filtering that caused all this faff in the first place.

For this particular case there are four values in gender. If I try gender == c("female", "male") here, this DOES actually work fine - because the categories are in the right order and are a multiple of the length of the dataset 🤦 But at least the numbers still wouldn’t match, which would tell me that something went wrong with filtering the whole dataset.

anx_data |> 
  dplyr::count(gender) |> 
  dplyr::filter(gender == c("female", "male"))

If I happened to have had the genders the other way round, I would have got an empty tibble, and hopefully that also would have clued me in that there was a problem with the original filtering.

anx_data |> 
  dplyr::count(gender) |> 
  dplyr::filter(gender == c("male", "female"))

Multiple Assertions

Logical assertions can also be combined to specify exactly the cases you want to retain. The two most important operators are:

  • & (AND): Only cases that return TRUE for all assertions will be retained.
  • | (OR): Any cases that return TRUE for at least one assertion will be retained.

Let’s look at a couple minimal examples to get the hang of these two symbols. For each of these, you can think of the single response R gives as the answer to the questions, “Are ALL of these assertions true?” for AND, and “Is AT LEAST ONE of these assertions true?” for OR.

First, let’s start with a few straightforward logical assertions:

"apple" == "apple"
[1] TRUE
23 > 12
[1] TRUE
42 == "the answer"
[1] FALSE
10 > 50
[1] FALSE

Next, let’s look at how they combine.

Two true statements, combined with &, return TRUE, because it is true that all of these assertions are true.

"apple" == "apple" & 23 > 12
[1] TRUE

Two true statements, combined with |, also return TRUE, because it true that at least one of these assertions is true.

"apple" == "apple" | 23 > 12
[1] TRUE

Two false statements, combined with &, return FALSE, because it is NOT true that all of them are true.

42 == "the answer" & 10 > 50
[1] FALSE

Two false statements, combined with |, return FALSE, because it is NOT true that at least one of them is true.

42 == "the answer" | 10 > 50
[1] FALSE

One true and one false statement, combined with &, return FALSE, because it is NOT true that all of them are true.

23 > 12 & 42 == "the answer"
[1] FALSE

One true and one false statement, combined with |, return TRUE, because it is true that at least one of them is true.

23 > 12 | 42 == "the answer"
[1] TRUE

To see how this works, let’s filter anx_data to keep only participants that saw the stats MCQs, OR that scored 3 or higher on the first STARS test subscale item.

This requires two separate statements, combined with | “OR”:

1anx_data |>
2  dplyr::filter(
3    mcq == "stats" |
4     stars_test1 >= 3
    ) 
1
Take the dataset anx_data, and then
2
Filter it keeping only the cases where the following assertion is true:
3
The value in the mcq variable is only and exactly equal to "stats", OR
4
The value in stars_test1 is greater than or equal to 3.

To evaluate this output, have a look at the mcq and stars_test1 variables. The mcq variable will still contain both stats and maths, but where it contains maths, the value in the stars_test1 variable should be 3 or more.

Exercises

Exercise

Produce a subset of anx_data that doesn’t contain any male participants.

anx_data |> 
  dplyr::filter(
    gender != "male"
  )
Exercise

Filter anx_data to keep only cases where the value of rmars_s_test2 is between 2 and 4.

Hint: You can use two separate assertions to do this, or check out dplyr::between().

For the first solution, we must use & “AND” to ensure that both these conditions are met simultaneously.

For the second solution, the dplyr::between() function does the same operation, without having to worry about getting AND vs OR right.

anx_data |> 
  dplyr::filter(
    rmars_s_test2 >= 2 & rmars_s_test2 <= 4
  )

anx_data |> 
  dplyr::filter(
    dplyr::between(rmars_s_test2, 2, 4)
  )

Data Cleaning

Filtering is absolutely invaluable in the process of data cleaning. In order to practice this process, I’ve introduced some messy values into the data, so let’s have a look at a method of cleaning up the dataset and documenting our changes as we go.

Thus far in this tutorial, we’ve been practicing filtering without storing our changes anywhere, but here we want our changes to “stick”. So, we are going to practice assigning the output of our piped commands to the same dataset name as we started with. This will overwrite the dataset with its updated version. Generally when you do data cleaning with this process, it’s a good idea to first draft your code WITHOUT assigning the output, so you can check your code and output carefully. Once you are confident that your code does only and exactly what you wanted it to do, assign the output to the dataset name to “save” your changes.

Pre-Exclusions

For data collected on platforms like Qualtrics, you can frequently test out your study via a preview mode. Responses completed via preview are still recorded in Qualtrics, but labeled as such in a variable typically called “DistributionChannel” or similar. In this dataset, we have a similar variable, distribution, that labels whether the data was recorded in a preview ("preview") or from real participants ("anonymous").

Your method may vary, but I wouldn’t bother to document these cases as “exclusions” because they aren’t real data. I would just drop them from the dataset - but of course make sure to record the code that does so.

1anx_data <- anx_data |>
2  dplyr::filter(
3    distribution == "anonymous"
    ) 
1
Overwrite the dataset anx_data with the following output: Take the dataset anx_data, and then
2
Filter it keeping only the cases where the following assertion is true:
3
The value in the distribution variable is exactly and only equal to “anonymous”.

We’ve seen this exact code structure before; the only difference from the examples above is that the output at the end of the pipe, including only “anonymous” cases, is assigned back to the same dataset name. This structure can be confusing at first. In the bit of code anx_data <- anx_data |> ..., the start of the pipe is actually the second anx_data, and the final, updated object is the first.

Recording Exclusions

As a part of complete and transparent reporting, we will want to report all of the reasons we excluded cases from our dataset, along with the number excluded. We can build this counting process into our workflow so that at the end, we have a record of each exclusion along with initial and final numbers.

Exercise

Follow along with the following data cleaning steps, trying them out in a code chunk for yourself as you go. You’ll need them at the end!

For each check below, our recording process will have two steps:

  1. Produce a dataset of the cases you will exclude, and count the number of rows (cases).
  2. Remove the cases and overwrite the old dataset with the new one.

In my process, I’m going to keep anx_data as the original, “raw” version of the dataset. So, I’ll create a copy in a new dataset object to use while “processing” that I will update as I go. This essentially functions as a save checkpoint; if I accidently overwrite anx_dat_proc while conducting my data cleaning, I can restart from this point rather from the very beginning of my code.

anx_data_proc <- anx_data

To begin, we will count the initial number of cases before any exclusions.

n_initial <- nrow(anx_data_proc)
n_initial
[1] 453

Remember that we can use nrow() because there is only one participant per row. If we had long-form data with observations from the same participant across multiple rows, we would have to do something a bit different!

Age

For low-risk ethics applications, you may want to exclude people who reported an age below the age of informed consent (typically 18). This may look like age >= 18 or similar in your dataset. However, it’s also important to check for errors or improbable ages, or to remove any participants that are too old if your study has an upper age limit. In this case, my hypothetical study didn’t have an upper age limit, but I’ll designate any ages as 100 or above as unlikely to be genuine responses.

Exercise

Store the counts of participants who are too young or too old in separate objects. Then, filter them out of the dataset.

## Store the number to be removed
n_too_young <- anx_data_proc |> 
  dplyr::filter(age < 18) |> 
  nrow()
n_too_young
[1] 22
n_too_old <- anx_data_proc |> 
  dplyr::filter(age >= 100) |> 
  nrow()
n_too_old
[1] 5
## Remove them
anx_data_proc <- anx_data_proc |> 
  dplyr::filter(
    dplyr::between(age, 18, 99)
  )

Missing Values

Finally (for now), just about any study will have to decide how to deal with missing values. The possibilities for your own work are too complex for me to have a guess at here, so for now we’ll only look at how to identify and remove missing values.

Single Variable

Let’s look at a single variable to begin with - for example, sticsa_trait_3. We can confirm that this variable has a/some NAs to consider by counting the unique values:

anx_data |> 
  dplyr::count(sticsa_trait_3)

Now that we know this variable does contain missing values, it’s time to remove them.

::: {.callout-note appearance=“minimal” title=“Exercise”}

Store the counts of participants missing a value for sticsa_trait_3 in a new object.

Hint: Look back on the Consent section for the function to get missing values.

The first thing you might think to try is to filter on sticsa_trait_3 == NA, but weirdly enough this doesn’t work. Instead, we again need the increasingly versatile is.na(), which again, we can think of as a question about whatever is in its brackets: “Is (this) NA?” Let’s see this in action:

anx_data_proc |>
  dplyr::filter(
    is.na(sticsa_trait_3)
  )

The logical assertion is.na(sticsa_trait_3) will return TRUE for cases where value in the sticsa_trait_3 variable IS missing (is NA).

These are the cases we want to remove, so we count how many there are and assign that number to a useful object name, as we did before.

n_sticsa_t3_missing <- anx_data_proc |>
  dplyr::filter(
    is.na(sticsa_trait_3)
  ) |> 
  nrow()

n_sticsa_t3_missing
[1] 3

Next, we need to actually exclude these cases. This time, we want to retain the inverse of the previous filtering requirement: that is, we only want to keep the cases that are NOT missing a value, the opposite of what we got from is.na(sticsa_trait_3). You may recognise “the inverse” or “not-x” as something we’ve seen before with !=, “not-equals”. For anything that returns TRUE and FALSE, you can get the inverse by putting an ! before it. (Try running !TRUE, for example!)

So, to create my clean anx_data_final dataset, I can use the assertion !is.na(sticsa_trait_3) to keep only the participants who answered this question - who do NOT have a missing value.

Finally, I can store the actual number of usable cases, according to my cleaning requirements, in a final object to use when reporting.

anx_data_final <- anx_data_proc |>
  dplyr::filter(
    !is.na(sticsa_trait_3)
  )

n_final <- nrow(anx_data_final)
n_final
[1] 390

Removing NAs is a tricky process, but if you’re sure that you want to drop all cases with missing values in your dataset, there are few helper functions to make this easy. However, this is a pretty major step and should be used with caution! If we didn’t check our data carefully, we could easily end up dropping a bunch of cases we didn’t want to get rid of.

For this, we’re going to leave filter() for a moment at look at a different function, tidyr::drop_na(). This function takes a tibble as input, and returns the same tibble as output, but with any rows that had missing values removed.

For example, if we apply it uncautiously here:

anx_data_proc |> 
  tidyr::drop_na()

Well, there goes all our data! Why has every single row in the dataset been dropped?

This is something we could work out without any R whatsoever, just using the codebook and a bit of View mode to confirm. The Codebook tells us that participants were in one of two independent conditions: "maths" or "stats". Because of the wide format of the data, there are mcq_maths questions that are always NA for people in the statistics-MCQ condition, and vice versa for the mcq_stats questions and people in the maths-MCQ condition. So, every single participant - even those who answered every question - has at least some missing values, and dropping NAs without checking just bins the whole dataset.

If I wanted to check this with R, I’d be hard pressed to do it with only what we’ve covered so far. Using the some extra challenge functions from the next tutorial, though, I’d do this:

anx_data |> 
  dplyr::mutate(
    ## Create a new variable containing the number of missing values in each row
    number_nas = rowSums(is.na(pick(everything())))
  ) |> 
  ## Count how many missing values there are
  dplyr::count(number_nas)

So, there’s at least 5 NAs in every single row, and when we call tidyr::drop_na(), every single row is dropped.

Reporting

Warning

This section is optional and demonstrates how to use inline code. It will be most useful if you expect to frequently use Quarto documents for reporting results as well as performing analyses/writing code.

Inline code is a way of using short bits of R code in the text (“in-line”) of a Quarto document. It takes the form `r some_r_code`, and when rendered, the code in some_r_code will be evaluated and printed in that place in the text.

As an example of how handy this can be, here’s how it might look to write up a short paragraph in Quarto describing the exclusions we just performed in this section.

The initial sample consisted of `r n_initial` cases. We removed `r n_no_consent` cases that did not consent, `r n_too_young` cases that reported an age below the ethical age of consent, and `r n_too_old` cases that reported improbable ages (100 years old or older). Finally, we removed `r n_sticsa_t3_missing` cases with who had not responded to the third trait item on the STICSA.This left us with a final sample of `r n_final` cases.

When you render your document, this should come out as:

The initial sample consisted of 453 cases. We removed 33 cases that did not consent, 22 cases that reported an age below the ethical age of consent, and 5 cases that reported improbable ages (100 years old or older). Finally, we removed 3 cases with who had not responded to the third trait item on the STICSA. This left us with a final sample of 390 cases.

There’s a huge advantage of this, namely ease of change. Imagine you had researchers from labs all over the world join the study and add a huge amount of new data to a massive collaborative dataset. In order to update all your numbers, all you have to do is update your initial anx_data dataset with the new cases, and then re-run all your code as is. Because these objects count whatever is in the data, they will automatically contain and record the correct numbers for the data you put into them4.

There are other advantages too - like confidence that you, a human person who may occasionally make errors (sorry, no offence meant!), won’t misread, mistype, or otherwise mistake the numbers, because at no point do you actually type a particular number yourself.

Nifty, eh?

Select

The select() function is probably the most straightforward of the core {dplyr} functions. Its primary job is to easily and transparently subset the columns within a dataset - in particular, a tibble. Rows are not affected by select(), only columns.

General Format

To subset a tibble, use the general format:

1dataset_name |>
2  dplyr::select(
3    variable_to_include,
4    -variable_to_exclude,
5    keep_this_one:through_this_one,
6    new_name = variable_to_rename,
7    variable_number
  )
1
Take the dataset dataset_name, and then
2
Select the following variables:
3
The name of a variable to be included in the output. Multiple variables can be selected separated by commas.
4
The name of a variable to be excluded from the output. Use either an exclamation mark (!) or a minus sign (-) in front of each variable to exclude. Multiple variables can be dropped, separated by commas with a ! (or -) before each.
5
A range of variables to include in the output. All the variables between and including the two named will be selected (or dropped, with !(drop_this_one:through_this_one)).
6
Include variable_to_rename in the output, but call it new_name.
7
Include a variable in the output by where it appears in the dataset, numbered left to right. For example, “2” will select the second column in the original dataset.

Columns will appear in the output in the order they are selected in select(), so this function can also be used to reorder columns.

Selecting Directly

The best way to get the hang of this will be to give it a go, so let’s dive on in!

Exercise

Create a subset of anx_data that contains the following variables:

  • The participant’s age
  • The first variable in the original dataset
  • All of the STARS variables
anx_data |> 
  dplyr::select(
    age, 1,
    stars_test1:stars_help4
  )
Exercise

Create a subset of anx_data that contains the following variables:

  • All of the original variables but NOT distribution
  • mcq renamed condition
anx_data |> 
  dplyr::select(
    -distribution,
    condition = mcq
  )

That’s really all there is to it!

Or is it?5

Using

The real power in select(), and in many other {tidyverse} functions, is in a system of helper functions and notations collectively called <tidyselect>. The overall goal of “<tidyselect> semantics” (as you will see it referred to in help documentation) is to make selecting variables easy, efficient, and clear.

New to UGs

At UG level at Sussex, students are not taught about <tidyselect> in core modules. However, <tidyselect> is desperately useful and makes complex data wrangling/cleaning a lot faster and more efficient, especially (for instance) for questionnaires with similarly-named subscales, so would make for a great collaborative activity with supervisors. <tidyselect> is also briefly introduced in an optional final-year workshop on working with Qualtrics data, so some students may have seen it before.

These helper functions can be combined with the selection methods above in any combination. Some very convenient options include:

  • everything() for all columns
  • starts_with(), ends_with(), and contains() for selecting columns by shared name elements
  • where() for selecting with a function, described in the next (optional) section

Open the help documentation by running ?dplyr::select in the Console to see examples of how to use all of the <tidyselect> helper functions.

Rather than list examples of all the helper functions here, it’s best to just try them out for yourself.

Exercise

Select the variables in anx_data that have to do with state anxiety.

anx_data |> 
  dplyr::select(
    contains("state")
  )
Exercises

Select all the variables in anx_data that are NOT the R-MARS, R-MARS-S, or STICSA.

anx_data |> 
  dplyr::select(
    ## contains() also fine (in this case)
    !starts_with(c("rmars", "sticsa"))
  )
Exercises

CHALLENGE: Select all the stars variables but NOT the stars_m variables.

This one’s a bit tricky because both starts_with() and contains() will return both types of STARS variables, because of the way the variables are named. We’ll have to provide multiple statements combined with logical operators, like we did earlier with filter().

anx_data |> 
  dplyr::select(starts_with("stars") & !contains("_m_"))

 

Using Functions

Here There Be Lambdas

The material in this section isn’t covered in the live workshops. It’s included here for reference because it’s extremely useful in real R analysis workflows, but it won’t be essential for any of the workshop tasks.

If you want to skip it, jump down to the next section.

Let’s say we want to generate a summary table of the variables in our dataset. Before we can create our summary in the next tutorial, we may first want to produce a subset of our dataset that only contains numeric variables.

To do this, we can use the <tidyselect> helper function where(). This helper function lets us use any function that returns TRUE and FALSE to select columns. Essentially, we don’t have to select columns using name or position - we can use any criteria we want, as long as we have (or can create…!) a function that expresses that criteria.

Especially helpful here is the is.*() family of functions in base R. This group of functions all have the same format, where the * is a stand-in for any type of data or object, e.g. is.logical(), is.numeric(), is.factor() etc. (The very useful is.na() that we’ve seen with filter() above is also a member of this family.) These functions work like a question about whatever you put into them - for example, is.numeric() can be read as, “Is (whatever’s in the brackets) numeric data?”

Tip

You can quickly find all of the functions in this family by typing is. in a code chunk and pressing Tab.

Putting these two together, we could accomplish the task of selecting only numeric variables as follows:

anx_data |> 
  dplyr::select(
    where(is.numeric)
  )

This command evaluates each column and determines whether they contain numeric data (TRUE) or not (FALSE), and only returns the columns that return TRUE.

Using Custom Functions

The function in where() that determines which columns to keep doesn’t have to be an existing named function. Another option is to use a “purrr-style lambda” or formula (a phrase you may see in help documentation) to write our own criteria on the spot.

For example, let’s select all of the numeric variables that had a mean of 3 or higher:

anx_data |>
  dplyr::select(
    where(~is.numeric(.x) & mean(.x, na.rm = TRUE) >= 3)
  )

Instead of just the name of a function, as we had before, we now have a formula inside where(). This formula has a few key characteristics:

  • The ~ (apparently pronounced “twiddle”!) at the beginning, which is a shortcut for the longer function(x) ... notation for creating functions.
  • The .x, which is a placeholder for each of the variables that the function will be applied to.

So, this command can be read: “Take the anx_data dataset and select all the columns where the following is true: the data type is numeric AND the mean value in that column is greater than or equal to 3 (ignoring missing values).”

Exercise

CHALLENGE: Select the variables in anx_data that are character type, or that do NOT contain any missing values.

Hint: You may need to use function(s) that we haven’t covered in the tutorials so far to solve this.

This one is a doozy! Very well done if you worked it out, either using your own solution or one like this, or if you got partway there.

anx_data |> 
  dplyr::select(
    where(~ is.character(.x) | all(!is.na(.x)))
  )

Here’s the process to understand/solve this using this particular solution.

The first half of the formula in where() should be okay - you may have noticed the <chr> label in the tibble output and/or guessed that there might be an is.*() function for this purpose.

The second half is a bit rough. You may have tried !is.na(.x) and got an error, namely: Predicate must return TRUE or FALSE, not a logical vector. In other words, this has to return a SINGLE logical value, and is.na() will return a vector containing logical values for each individual value in the variable.

To solve this - at least the way I’ve done - you need the {base} function all(), which answers the question, “Are all of these values TRUE?” It also has a (non-identical) twin any(), which (as you might guess) answers the question, “Are any of these values TRUE?” So, all() does a similar job as AND, and any() a similar job as OR.

To see what I mean, let’s just try it out:

all(TRUE, TRUE)
[1] TRUE
all(TRUE, FALSE)
[1] FALSE
all(FALSE, FALSE)
[1] FALSE
any(TRUE, TRUE)
[1] TRUE
any(TRUE, FALSE)
[1] TRUE
any(FALSE, FALSE)
[1] FALSE

Like AND and OR, all() and any() only give different responses when there are a mix of TRUEs and FALSEs. For this task, we only wanted to retain variables where ALL of the values produced by !is.na(x) were TRUE - that is, it was true that ALL of the values in that variable do NOT contain NAs. So, we wanted all(). This returns a single TRUE or FALSE value for each variable that dplyr::select() can use.

Quick Test

In the very first of these tutorials, we ran a t-test using some vectors of made-up data. We’re going to do the same thing again, but look at how we can use real data in a dataset instead.

Exercise

Bring up the help documentation for t.test() and use it to run a t-test comparing responses for the first item of the STICSA post-test state anxiety scale between people who saw the maths MCQs vs the stats MCQs.

Hint: Try using the formula notation - see formula under Arguments.

Call up the help documentation in the Console:

?t.test
help(t.test)

Run the test:

## Pipe method
anx_data |> 
  t.test(sticsa_post_state_1 ~ mcq, data = _)

    Welch Two Sample t-test

data:  sticsa_post_state_1 by mcq
t = 0.88618, df = 448.97, p-value = 0.376
alternative hypothesis: true difference in means between group maths and group stats is not equal to 0
95 percent confidence interval:
 -0.09741366  0.25741366
sample estimates:
mean in group maths mean in group stats 
               2.00                1.92 
## Single command method
t.test(sticsa_post_state_1 ~ mcq, data = anx_data)

    Welch Two Sample t-test

data:  sticsa_post_state_1 by mcq
t = 0.88618, df = 448.97, p-value = 0.376
alternative hypothesis: true difference in means between group maths and group stats is not equal to 0
95 percent confidence interval:
 -0.09741366  0.25741366
sample estimates:
mean in group maths mean in group stats 
               2.00                1.92 

There are a lot of options in the t.test() function, which can be used, through different arguments, to run almost any variety of t-test you can think of. Here, the “stats” and “maths” groups are independent groups, so we can mostly go with the defaults. The formula takes the form outcome ~ predictor, where the predictor is the grouping variable.

Note that the output mentions “Welch Two Sample t-test”, which is a version of the test that does not assume equal variances. This is the version that is taught to undergraduates, because we have not at this point introduced the process of assumption testing. If you definitely know that the variances are equal and you definitely want Student’s t-test, you can instead change the default setting.

Exercise

CHALLENGE: Using the help documentation, re-run the t-test with equal variances assumed.

t.test(sticsa_post_state_1 ~ mcq, 
       data = anx_data, 
       var.equal = TRUE)

    Two Sample t-test

data:  sticsa_post_state_1 by mcq
t = 0.88619, df = 449, p-value = 0.376
alternative hypothesis: true difference in means between group maths and group stats is not equal to 0
95 percent confidence interval:
 -0.09741222  0.25741222
sample estimates:
mean in group maths mean in group stats 
               2.00                1.92 

Finally, we may want to report this t-test result. We could do this manually with inline code, but as a cheeky shortcut we’ll make use of the {report} package from {easystats}, which are a coherent and interconnected family of packages like {tidyverse}. It really is this easy:

mcq_ttest <- t.test(sticsa_post_state_1 ~ mcq, 
       data = anx_data, 
       var.equal = TRUE)

report::report(mcq_ttest)
Effect sizes were labelled following Cohen's (1988) recommendations.

The Two Sample t-test testing the difference of sticsa_post_state_1 by mcq
(mean in group maths = 2.00, mean in group stats = 1.92) suggests that the
effect is positive, statistically not significant, and very small (difference =
0.08, 95% CI [-0.10, 0.26], t(449) = 0.89, p = 0.376; Cohen's d = 0.08, 95% CI
[-0.10, 0.27])

You may not want to use this exact text as it is in your own papers, but it makes a nicely readable bit of output rather than a jumble of numbers.

 

Whew, that’s it! Well done getting through all that. In the next tutorial, we will look at more {dplyr} powerhouses to round out your data wrangling toolkit.

Footnotes

  1. Sorry 😅↩︎

  2. This incredibly useful property is called “data masking”. If you want to know more, run vignette("programming") in the Console.↩︎

  3. I’m not wild about this example - the experiences of non-binary and other genders are just as important! Unfortunately it’s the only variable in the dataset with the right number of categories.↩︎

  4. I’m sure Jenny would tell you there’s a little more to it than that, especially with 12,570 students from 100 universities in 35 countries, collected in 21 languages! But that’s both the dream and the general idea.↩︎

  5. Have you seen the size of this tutorial?? Of course it isn’t!↩︎